Distributed Vector Architecture : Beyond a Single Vector - IRAM

نویسندگان

  • Stefanos Kaxiras
  • Rabin Sugumar
  • James Schwarzmeier
چکیده

integration of memory on the same die as the processor (IRAM) has the potential to offer unprecedented bandwidth that can be exploited efficiently by vector processors. However, real-world scientific vector applications with their very large memory requirements and their poor locality, would easily overflow any single IRAM device. In this environment, traditional approaches such as caching or paging generate considerable traffic, diminishing the performance advantage of processor-memory integration. To exploit the full potential of IRAM in the realm of large-scale scientific computing , we propose a DIstributed Vector Architecture (DIVA), that uses multiple vector-capable IRAM nodes in a distributed shared-memory configuration. The advantages of our approach are twofold: (i) we speed up the execution of the vector instructions by parallelizing them across the nodes, (ii) we reduce external traffic, by bringing computation to data rather than data to computation. We dynamically map the computation of individual vector instructions on nodes to coincide, to the extent possible, with the corresponding data in memory. As an implementation, we propose a mechanism to assign at run-time elements of the architectural vector registers on nodes, using the layout of data in memory as a blueprint. Using traces of vector supercomputer programs we demonstrate that DIVA often generates considerably less external traffic compared to single or multiple-node alternatives that are based solely on caching or paging. Considerable performance gains are then possible because of DIVA's inter-node parallelism.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Media-Enhanced Vector Architecture for Embedded Memory Systems

Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. This paper presents the architecture of Vector IRAM (VIRAM), a processor that combines vector pro...

متن کامل

Distributed vector architectures

Integrating processors and main memory is a promising approach to increase system performance. Such integration provides very high memory bandwidth that can be exploited efficiently by vector operations. However, traditional vector applications would easily overflow the limited memory of a single integrated node. To accommodate such workloads, we propose the DIstributed Vector Architecture (DIV...

متن کامل

For Embedded Applications with Data-level Parallelism, a Vector Processor Offers High Performance at Low Power Consumption and Low Design Complexity. unlike Superscalar and Vliw Designs, a Vector Processor Is Scalable and Can Optimally Match Specific

Designers of embedded processors have typically optimized for low power consumption and low design complexity to minimize cost. Performance was a secondary consideration. Nowadays, many embedded systems (set-top boxes, game consoles, personal digital assistants, and cell phones) commonly perform computation-intensive media tasks such as video processing, speech transcoding, graphics, and high-b...

متن کامل

Hardware/Compiler Co-development for an Embedded Media Processor

Embedded and portable systems running multimedia applications create a new challenge for hardware architects. The microprocessor needed for such systems is a merged general-purpose processor and digital-signal processor, with the programmability the former and the performance and power budget of the latter. This paper presents the co-development of the instruction set, the hardware, and the com...

متن کامل

Image Segmentation on IRAM

The Computer Vision group at U.C. Berkeley recently developed a novel approach to image segmentation, called the Normalized Cuts algorithm. The current implementation of the algorithm has an execution time on the order of minutes for medium-sized images running on conventional scalar machines. This paper explores the current bottlenecks and seeks to maximize the performance by porting the algor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997